arXiv:1502.04742vl [stat.ML] 16 Feb 2015 


On the Predictive Properties of 
Binary Link Functions 

Necla Gunduz 
Department of Statistics 

Gazi Universitesi, Fen Faktiltesi, Istatistik Bolumii 
06500 Teknikokullar, Ankara, Turkey 
ngunduz@gazi.edu.tr 


Ernest Fokoue* 

Center for Quality and Applied Statistics 
Rochester Institute of Technology 
98 Lomb Memorial Drive, Rochester, NY 14623, USA 
ernest.fokoue@rit.edu 


Abstract 

This paper provides a theoretical and computational justification of the long held claim that of the simi¬ 
larity of the probit and logit link functions often used in binary classification. Despite this widespread 
recognition of the strong similarities between these two link functions, very few (if any) researchers have 
dedicated time to carry out a formal study aimed at establishing and characterizing firmly all the aspects 
of the similarities and differences. This paper proposes a definition of both structural and predictive equiv¬ 
alence of link functions-based binary regression models, and explores the various ways in which they are 
either similar or dissimilar. From a predictive analytics perspective, it turns out that not only are probit 
and logit perfectly predictively concordant, but the other link functions like cauchit and complementary 
log log enjoy very high percentage of predictive equivalence. Throughout this paper, simulated and real 
life examples demonstrate all the equivalence results that we prove theoretically. 

I. Introduction 

Given ( X\,y\ ), ■ ■ ■ , (x„,i/„), where xj = (x;i, ■ ■ ■ ,x ;p ) denotes the p-dimensional vector of charac¬ 
teristics and i/,- G {0,1} denotes the binary response variable, binary regression seeks to model 
the relationship between x and y using 


7r( X/ ) = PrfY, = l|x ; ] = F(t](xi)) (1) 

where 

tj( x i) = Po + Pi*n + -F jSpX ip = xf p i = (2) 

for a (p + 1)-dimensional vector fi — ( /Iq, fi\, ■ ■ ■ , fip) 1 of regression coefficients and F(-) is the 
cdf corresponding to the link functions under consideration. Specifically, the cdf F(-) is the 
inverse of the link function g(-), such that p(x,) = F -1 (;t(x/)) = g(n(xj)) — g(E(Y;|x;)). Table 

* Corresponding Author 


1 





l|T]l provides specific definitions of the link functions considered in this paper, along with their 
corresponding cdfs. 


Model Link function cdf 


Probit 




*(«) 

Compit 

lo g[— 

log(l - ®)] 


1 - e - £ “ 

Cauchit 

tan 

TZV — y 


1 

n 

tan i (w) -f- •£?- 

Logit 

log [t^ 


^( U ) l+e- u 


Table 1: Link functions along with corresponding cdfs 

The above link functions have been used extensively in a wide variety of applications in fields as 
diverse as medicine, engineering, economics, psychology, education just to name a few. The logit 
link function for which 


n(xi) = Pr [Yj = l|x,] = A(?/(x,)) = 


-'7 M 


(3) 


is the most commonly used of all of them, probably because it provides a nice interpretation 
of the regression coefficients in terms of the ratio of the odds. The popularity of the logit link 
also comes from its computational convenience in the sense that its model formulation yields 
simpler maximum likelihood equations and faster convergence. In fact, the literature on both 
the theory and applications based on the logistic distributio n is so vast it wo u ld be un thinkable 
to ref e rence even a fract i on of it. Some rece nt au thors like Zeltermanl 1 198S), Schumacher et al. 
( 19961) . Nadarajahl d2004h . Lin and Hu! d2008h and Nassar and Elmasrv 1 2012 ) provide extensive 
studies on the characteristics of generalized logistic distributions, somehow answering the ever 
increasing interest in the logistic family of distributions. Indeed, applications abound that make 
use of both the standard logis tic regression model and the so-called generalized logistic r egres- 
sion model, as can be seen in Ivan den Hout et al. ( 2007 ) and Tamura and Giampaoli ( 2013 ). The 
probit link, for which 


n(xi) = Pr [Yj = 11 x/] = F(;/(x/)) 

., , „ nM i 

- <%(*;)) - / -7=t 

7-00 y 2.TZ 


_ 1 ~2 
2 ^ dz 


(4) 


is the second most commonly used of all the link functions, with Bayesian researchers seem ¬ 
ingly toppin g the charts in its use. See Basu and Mukhopadhvay 2000l) . Csato et ahl ([2000), 
Chakrabortv J20 09 ) for a few exam ples of probit use in binary classification in the Bayesian 
setting. Armagan and Zaretzki (2011) is just another one of the references pointing to the use of 
the probit link function in the statistical data mining and machine learning communities. 

In the presence of some many possible choices of link functions, the natural question to ask is: 
how does one go about choosing the right/suitable/appropriate link function for the problem 
at hand? Most experts and non-experts alike who deal with binary classification tend to almost 
automatically choose the logit link, to the point that it - the logit link - has almost been attributed 
a transcendental place. Fro m exp erience, experimentation and mathematical proof, it is our 
view, a view shared by Feller! Jl97lh and Feller! dl940l) . that all these link function are equivalent, 
both structurally and predictively. Indeed, our conjectured equivalence of binary regression link 
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functions is strongly supported by William Feller in his vehement criticism of the overuse of 
the logit link function and a tendency to give it a place above the rest of existing link functions. 
In iFellerl dl97lh 's own words: An unbelievably huge literature tried to establish a transcendental "law 
of logistic growthmeasured in appropriate units, practically all growth processes were supposed to be 
represented by a function of the form @ with t representing time. Lengthy tables, complete with chi-square 
tests, supported this thesis for human populations, for bacterial colonies, development of railroads, etc. Both 
height and iveight of plants and animals were found to follow the logistic law even though it is theoretically 
clear that these two variables cannot be subject to the same distribution. Laboratory experiments on bacteria 
showed that not even systematic disturbances can produce other results. Population theory relied on logistic 
extrapolations (even though they were demonstrably unreliable). The only trouble with the theory is that 
not only the logistic distribution but also the normal, the Cauchy, and other distributions can be fitted 
to the same material with the same or better goodness of fit. In this competition the logistic distribution 
plays no distinguished role whatever ; most contradictory theoretical models can be supported by the same 
observational material. 


As a matter of fact, it's obvious from the plot of their densities for instance that the probit and logit 
are virtually identical, almost superposed one on top of the other. It is therefore not surprising 
that one would empirically notice virtually no difference when the two are compared on the 
same binary regression task. Despite this apparent indistinguishability due to many of their 
similarities, it is fair to recognize that the two functions different, at least by definition and by 
their very algebra. C hambers and Cox (1967) argue in their paper that probit and logit will yield 
different results in the multivariate context. Their work is a rarety in a context where most 
researchers seem to have settled comfortably with the acceptance of the fact that the two links 
are essentially the same from a utility perspective. For such researchers, using one over the 
other is determined solely by mathematical convenience and a matter of taste. We demonstrate 
both theoretically and computationally that they all predictively equivalent in the univariate case, 
but we also provide a characterization of the conditions under which they tend to differ in the 
multivariate context. 



Figure 1: Densities corresponding to the link functions. The similarities are around the center of the distributions. 
Differences can be seen at the tails 
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Figure 2: Cdfs corresponding to the link functions. The similarities are around the center of the distributions. Differ¬ 
ences can be seen at the tails 


Throughout this work, we perform model comparison and model selection using both Akaike 
Information Criterion (AIC) and Bayesian Information Criterion (BIC). Taking the view that the 
ability of an estimator to generalize well over the whole population, provides the best measure of 
it ultimate utility, we provide extensive comparisons of the performances of each link functions 
based on their corresponding test error. In the present work, we perform a large number of 
simulations in various dimensions using both artificial and real life data. Our results persistently 
reveal the performance indistinguishability of the links in univariate settings, but some sharp 
differences begin to appear as the dimension of the input space (number of variables measured) 
increased. 

The rest of this paper is organized as follows: section 2 presents some general definitions, namely 
our meaning of the terms predictive equivalence and structural equivalence, along with some 
computational demonstrations on simulated and real life data. This section also clearly describes 
our approach to demonstrating/verifying our claimed results. We show in this section, that for 
low to moderate dimensional spaces, goodness of fit and predictive performance measures reveal 
the equivalence between probit and logit. Section 3 provides our formal proof of the equivalence 
of probit and logit. Section 4 reveals that there might be some differences in performance when 
the input space becomes very large. Our demonstration in this section in based on the famous 
AT&T 57-dimensional Email Spam Data set. Section 5 provides a conclusion and a discussion, 
along with insights into extensions of the present work. 

II. Definitions, Methodology and Verification 

Throughout this work, we consider comparing models both on the merits of goodness of fit, and 
predictive performance. With that in mind, we can then define equivalence both from a goodness 
of fit perspective and also from a predictive optimality perspective.From a predictive analytics 
perspective for instance, an important question to ask is: given a randomly selected vector x, 
what is the probability that the prediction made by probit will differ from the one made by logit? 
In other words, how often do the probit and logit link functions yield difference predictions? This 
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is particularly important in predictive analytics in the data mining and machine learning where 
the nonparametric nature of most models forces the experimenter to focus on the utility of the 
estimator rather than its form. We respond to this need by defining what we call the 100(1 — a)% 
predictive equivalence. 


II.l Basic definitions and results 

Definition 1. (Binary classifier) Given an input space X and a binary response space y — {0,1}, we 
define a (binary) classifier h to be a function that maps elements of X to {0,1}, or more specifically 

h : X -t {0,1} 
x i—t h(x) 

In the generalized linear model (GLM) framework, given a link function zvith corresponding cdf F(-), a 
binary classifier h under the majority rule takes the form 

M x ) = \ 11 + sign ^tt(x) - 0 j, 

where 7r(x) = Pr[Y — l|x] — F(?/(x)) and z/(x) is the linear component. For instance, the logit binary 
classifier is given by 

Kgiti*) = \ j 1 + sign ^A(t/(x)) - 0 J , 
and the the probit binary classifier is given by 

hprobiti*) = \ |l +sign - ~ 

where A(-) and <&(•) are as defined in Table Q}. 

We shall measure the predictive performance of a classifier h by choosing a loss function /:'(■, ■) 
and then computing the expected loss (also known as risk functional) R(h) as follows: 

R(h) — ~E[£(Y,h(X))] — f £{y,h(x))p(x,y)dxdy. 
j x x y 

Under the zero-one loss function £{Y,h(X)) — 1 / y^hix)\> the risk functional R(h ) is the misclas- 
sification rate, more specifically 

R(h) - E[l(Y,h(X))] 

= [ £(y,h(x))p(x,y)dxdy — Pr[Y ^ h(X)]. 

JXxy 

In practice, R(h) cannot be computed in closed-form because the distribution of (X, Y) is un¬ 
known. We shall therefore use the so-called the average test error or average empirical prediction 
error as our predictive performance measure to compare classifiers. 

Definition 2. (Average Test Error) Given a sample {(x u y,), i = !,■■■, n}, we randomly form a training 

set {(xj tr) ,j/. tr) ),z = 1 , ••• ,«tr} and a test set {(x[ te ^,j/| te ^),z = 1 , ••• ,zz te }- Y<te typically run 
R = 1000 replications of this split, zvith 2/3 of the data allocated to the training set and 1/3 to the test 
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set. The test error here under the symmetric zero-one loss is given by 


1 Hte 

Rtest(h) = TE(/z) - 

#{ y ; te) ^ fe( X j te) )} 

tlte 

from which the average test error ofh over R random splits of the data is given by 


1 R 

ATE (h) = - £ TE r {h), 

V r—1 


where TE r (7x) is the test error yielded by h on the rth split of the data. 

Definition 3. (Predictively concordant classifiers) Let h\ and h 2 be two classifiers defined on the same 
p-dimensional input space X. We shall say that h\ and h 2 are 100(1 — a)% predictively concordant if 
VX G X drawn according to the density px( x )/ 


Pr 


h 1 (X)^h 2 (X) 


— a.. 


In other zvords, h\ and h 2 are 100(1 — a)% predictively concordant if the probability of disagreement 
betzoeen the tzvo classifiers is a. When a — 0, we say that h\ and h 2 are perfectly predictively concordant. 

Definition 4. (Predictively equivalent classifiers) Let h\ and h 2 be two classifiers defined on the same 
p-dimensional input space X. We shall say that h\ and h 2 are predictively equivalent if the difference 
betzveen their average test errors is negligible, i.e., ATE (hf) ~ kTE{h 2 ). 

Lemma 1. If X ~ Logistic(0,1), and Y = y^fX, then Y N(0,1). 

Demonstration: Figure (J3]l below shows that the scaled version of the logistic cdf lines up almost 
perfectly with the standard normal. 



Figure 3: Scaled version of the logistic CDF superposed on the standard normal CDF. 
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Lemma 2. Let $(•) denote the standard normal cdf. Then 


sign ^O(z) - 0 = sign ( ~ ^ ■ 

Theorem 1. The probit and logit link functions are perfectly predictively concordant. Specifically, given 
an input space X and a density px{ x ) on X, 


Pr 


^logit(X) 7^ ^probit (X) 


-0, 


for all X G X drawn according to px( x )- 
Proof. For a given X, Let E be the event 

E = I sign (a(j j(X)) - 0 + sign ($>{q{X)) - 0 J 

we must show that 8 — Pr[E] — 0. Based on Lemma (JTJ), we can write £ as 

E - jsign + sign ^(; ? (X)) - 0 J . 

Then 8 — Pr[E] = 0. Thanks to Lemma ||2), it is straightforward to see that 8 — (J. □ 

Definition 5. Let M\ and M 2 be two binary regression models based on two different link functions 
defined on the same p-dimensional input space. We shall say that M\ and Mi are structurally equivalent 
if there exists a nonzero real constant AeE* such that ps Afij for allj— 1, • • • , p. In other words, 
the parameters of M\ are just a scaled version of the parameters of M 2 , so that knowing the parameters of 
Mi is sufficient to completely determine the parameters of M 2 , and vice-versa. 

Theorem 2. The logit and probit models are structurally equivalent. 

Proof. Thanks to Lemma 1 QJ, we can write 


A(x T /3 (l0 § it )) 


<T> (x 1 


<b(x T ^ (probit )), 


where 


^(probit) 



We have therefore found a nonzero real constant A 


yf such that ^(p robit ) « AjS( 1 °s it ). 


□ 


II.2 Computational Verification via Simulation 

To get deeper into how strongly related the probit and logit models are, we now seek to esti¬ 
mate via simulation, the constant coefficient that relates their parameter estimates. Indeed, we 
conjecture that /L lo S lt! and /LP roblt ) are linearly related via the regression equation 

^(probit) — T _|_ fl^flogit) + V/ 
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where r is the intercept and v is the noise term. To estimate one instance of 9, we generate M 
random replications of the dataset, and for each replication we estimate a copy of /?, and with 
it we also compute an estimate of p — cor(/3' problt ) / | 3 ,lo g lt; )) the correlation coefficient between 
p(proht) anc | p(Iogit). gy re p ea ting the estimation R times, we gather data to determine the central 
tendency of 9 and the corresponding correlation. 


For r = 1 to R 

For s = 1 to S 

* Generate a replicate of the random sample of {(x,-,l/;),i = 1, • • • ,n} 

* Estimate the logit and probit model coefficients and jg(P ro ^ f ) 

End 

- Store the simulated data 7)9') — jg(P ro * , ' f )), s = 1, ••• ,S} 

- Fit , the regression model — T + 9^J°‘''^ + v s using 2?( r ) 

- Extract the coefficient 9^ r ) from JV[ l ' r ) 

- Compute p^i estimate of correlation between ^)P robit ) and 

Collect {6^ andp( r ), r — 1, then compute relevant statistics. 


Example 1: We consider a random sample of n — 199 observations {(x,, i/,j, i — 1, ■ • • ,n} where 
the x/ are equally spaced points in an interval [a,b\, that is, X; = a + ( z — 1 )/ an d Vi 

are drawn from one of the binary regression models. For instance, we set the domain of x, to 
[a, b] — [0,1] and generate the Y/s from a Cauchit model with slope 1/2 and intercept 0, i.e., 

Yj ~ Bernoulli(7r(x;)), with 


Pr[Y; = l|x,-] = 7 r(xj) 



7T 

2 


Using R = 99 replications each running S = 199 random samples, we obtain the following results, 
see Fig (|4|. The most striking finding here is that the estimated coefficient of determination 
is roughly equal to 1 , indicating that the knowledge of logit coefficient almost entirely helps 
determine the value of the probit coefficient. Flence our claim of structural equivalence between 
probit and logit. The value of the slope 9 appears to be in the neighborhood of 0.6. 

Example 2 : We now consider the famous Pima Indian Diabetes dataset, and obtain parameter 
estimates under both the logit and the probit models. The dataset is 7-dimensional, with X] — 
npreg, X 2 — glu, X 3 = bp, X 4 = skin, X 5 = bmi, xg = ped and X 7 = age. Under the logit model, 
the probability that patient i has diabetes given its characteristics x, is given by 

1 

PrfDiabetes,- = l|x ; ] = 7 r(x ; ) = ^ _, /(x . } . 


where 


V( x i) ~ ^0 + ^inpreg + j 62glu + J 6 3 bp + ^4skin 

+ + /Igped + frage. 

We obtain the parameter estimates using R, and we display in the following table their values. 
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Figure 4: (Left) Scatterplot of j$(P roblt ) against f>^°S lt ) based on the R replications generated; (Center) Boxplot of the R 
replications of the estimate of the coefficient of determination between f,(P roblt ) and (Right) Boxplot 

of the R replications of the estimate of the slope 0 


Model 

npreg 

glu 

bp 

skin 

Probit 

0.0592 

0.0192 

-0.0024 

-0.0017 

Logit 

0.1031 

0.0321 

-0.0047 

-0.0019 

Ratio 

0.57434 

0.5987 

0.5181 

0.9073 


Table 2: Parameter estimates under probit and logit for the Pima Indian Diabetes Data Set 


Model 

bmi 

ped 

age 

Probit 

0.0505 

1.0682 

0.0249 

Logit 

0.0836 

1.8204 

0.0411 

Ratio 

0.6044 

0.5868 

0.6064 


Table 3: Parameter estimates under probit and logit for the Pima Indian Diabetes Data Set 


As can be seen in the above Table (??), the ratio of the probit coefficient over the logit coefficient 
is still a number around 0.6 for almost all the parameter. Indeed, the relationship 

piprobit) ^ T + 0 ^(logit) + v 

appears to still hold true. The deviation from that pattern observed in variable skin is probably 
due to the extreme outlier in its distribution. It is important to note that although our theoretical 
justification was built under the simplified setting of a univariate model with no intercept, the 
relationship uncovered still holds true in a complete multivariate setting, with each predictor 
variable obeying the same relationship. 

Example 3: We also consider the benchmark Crabs Leptograpsus dataset, and obtain parameter 
estimates under both the logit and the probit models. The dataset is 5-dimensional, with X | = FL, 
X 2 = RW, X 3 = CL, X 4 = CW and X 5 = BD. Under the logit model, the probability that the sex of crab 
i is male given its characteristics x, is given by 

1 

Prjsex; = 1 lx;] = 7i(xi) — --, 

1 1 1 v " l-|_ e -i7(x;)' 
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where 


rj(xi) = f 0 + f>\VL + j3 2 R-W + fcCL + £ 4 CW + fcBD. 

We obtain the parameter estimates using R, and we display in the following table their values. 


Model 

FL 

RW 

CL 

BD 

Probit 

-3.5572 

-11.4801 

5.5364 

1.6651 

Logit 

-6.1769 

-19.9569 

9.6643 

2.8927 

Ratio 

0.5758 

0.5752 

0.5728 

0.5756 


Table 4: Parameter estimates under probit and logit for the Crabs Leptograpsus Data Set 


As can be seen in the above Table 10 }, the estimate 6 of the ratio 8 of the probit coefficient over the 
logit coefficient is still a number around 0.6 for almots all the parameter. Indeed, the relationship 

p(rrobit) ^ T + 0 ^(.l°g i t) + v 

appears to still hold true. It is important to note that although our theoretical justification was 
built under the simplified setting of a univariate model with no intercept, the relationship uncov¬ 
ered still holds true in a complete multivariate setting, with each predictor variable obeying the 
same relationship. 


Fact 1. As can be seen from the examples above, the value of 8 lies in the neighborhood of 0.6, regardless 
of the task under consideration. This supports and confirms our conjecture that there is a fixed linear 
relationship between probit coeff dents and logit coefficients to the point that knowing one implies knowing 
the other. Hence, the two models are structurally equivalent. In a sense, wherever logistic regression has 
been used successfidly, probit regression will do just as a job. This result confirms what was already noticed 


oeen used successfully, promt regression will cto just, 
and strongly expressed by Felleil l !97ll ) (pp 52-53). 


II.3 Likelihood-based verification of structural equivalence 

In the proofs presented earlier, we focused on the parameters and never mentioned their esti¬ 
mates. We now provide a likelihood based verification of the structural equivalence of probit and 
logit. Without loss of generality, we shall focus on the univariate case where the underlying linear 
model does not have the intercept ()q, so that //fx ; ) = /3x ( . With x, denoting the predictor variable 
for the zth observation, we have the probability model Pr[Y; = l|x;] = zr(x;) = Fjrjjxf)) — F{f>xj). 
Let ^hogit) anc | ^(probit) denote tl le estimates of f> for the logit and the probit link functions 
respectively. Our first verification of the equivalence of the above link functions consists of show¬ 
ing that j3( loglt ) and f(P TOhlt ) are linearly related through y3(P roblt ) — r + 0j3^ loglt ) + v, with a 
coefficient of determination very close to 1 and a slope 8 that remains fixed regardless of the task 
at hand. We derive the approximate estimates of 8 theoretically using Taylor series expansion, 
but we also confirm their values computationally by simulation. 

Theorem 3. Consider an i.i.d sample {x\,yi), {x 2 ,yf), ■ ■ ■ ,(x n ,y n ) where x, £ R zs a real-valued 
predictor variable, and y - x £ {0,1} is the corresponding binary response. First consider fitting the probit 
model Pr[Y,- = l|x,] — 7r(x,-) — <h(/3x,) to the data, and let y3(P roblt ) denote the corresponding estimate 
off. Then consider fitting the logit model and Pr[Y,- = l|x,] = 7r(x ; ) = 1/(1 + exp (—jSx,-)) to the data, 
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and let ^( l0 S lt ) denote the corresponding estimate of /8. Then, 

^(probit) _ o. 6 25)6^ logi 


Proof. Given an i.i.d sample [x\,y\), (x 2 , 1 / 2 ), • • • , (x n ,y n ) and the model Pr[Y; = l|x,-] = 7r(x;), the 
loglikelihood for f is given by 

m = log m 


E {y-'og 7T ( x <) + (! - yO i°g(i - tt( x /))}- 


Under the logit link function, we have 7t(x,-) = 1/(1 + e ! !x '). Now, using a Taylor series expan¬ 
sion around zero for the two most important parts of the loglikelihood function, we get 

aiog(7T(x,)) = N _ Efl , ^Lfi3 _ ±p5 

dS 2 4 P 48 P 480 P ' 


dlogU ~ n (*i)) = _ N n , N«3 _ 

3/3 2 4 P 48 P 480 P ' 

The derivative of the approximate log-likelihood function for the logit model is then given by 


V {(I- y:) -2L-2L3 + 

l 2 4 p 48 p 480 p 


which, upon ignoring the higher degree terms in the expansion becomes 


\P) - E i 4 y« x i ~ 2x ‘ ~ X ^P 


It is straightforward to see that solving i' {f ) =0 for (1 yields 

n n 

2 E X dJi - E X i 

jg(logit) 2 i = l _ i=1 

X>? 

L i—\ 


If we now 


consider the probit link function, we have n (x,-) — <E>(/3x ; ) = f -^==e ? zZ . Using a 
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derivation similar to the one performed earlier, and ignoring higher order terms, we get 


m = 


- E|y ! (cix ; -2c 2 /Sxf) | 

+ e{(!- Vi) (-CiXj - 2c 2 /3x; 


= E 1 2c i x /y/ - c i x / - 2c 2jS> 


where C\ — 0.797885 and c 2 = 0.31831. This leads to 


2 E x iVi ~ E x < 

/?(probit) ^ £l i —1 _ i —1 

P II 


It is then straightforward to see that 


or equivalently 


ft(probit) c 

— ~ p- = 0.625, 

^(logit) 4 c 2 


/3 (pr °bit) ~ 0.625/3 (1 °s i 


It must be emphasized that the above likelihood-based theoretical verifications are dependent 
on Taylor series approximations of the likelihood and therefore the factor of proportionality are 
bound to be inexact. It's re-assuring however to see that our computational verification does 
confirm the results found by theoretical derivation. 


III. Similarities and Differences beyond Logit and Probit 

Other aspects of our work reveal that the similarities proved and demonstrated above between 
the probit and the logit link functions extend predictively to the other link functions mentioned 
above. As far as structural equivalence or the lack thereof is concerned. Appendix A contains 
similar derivations for the relationship between cauchit and logit, and the relationship between 
compit and logit. As far as, predictive equivalence is concerned, we now present a verification 
based on the computation of many replications of the test error. 

III.l Computational Verification of Predictive Equivalence 

We now computationally compare the predictive merits of each of the four link functions consid¬ 
ered so far. To this end, we compare the estimated average test error yielded by the four link 
functions. We do so by running R — 10000 replications of the split of the data set into train¬ 
ing and test set, and at each iteration we compute the corresponding test error for the classifier 
corresponding to each link functions. For one iteration/replication for instance, Rtest (f < P Toblt i) / 
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Rtest (/( compit )), -Rtest(/^ cauchlt ^) and Rtest {f^ loslt ^) are the values of the test error generated 
by probit, compit, cauchit and logit respectively. After R replications, we have R random real¬ 
izations of each of those four test errors. We then perform various statistical calculations on the 
R replications, namely median, mean, standard deviation, kurtosis, skewness, IQR etc..., to assess the 
similarity and the differences among the link functions. We perform the similar R replications 
for model comparison using both AIC and BIC. 

Example 4: Verification of Predictive Equivalence on Artificial Data: {(x„ yf), i — 1, ••• ,n} where 
x ; - ~ Normal(0,2 2 ) and y ; - G {0,1} are drawn for a cauchy binary regression model with fo — 1 
and 1 — 2, namely Y; ~ Bernoulli(7r(x/)) where 


7 T(x/) = Pr [Y; = 11X/] 


1 

71 


tan 1 (1 + 2 Xj) + y 


Table© shows some statistics on R — 10000 replications of the test error. The above results 
suggest that the four link functions are almost indistinguishable as the estimated statistics are 
almost all equally across the examples. 



probit 

compit 

cauchit 

logit 

median 

0.16 

0.16 

0.16 

0.16 

mean 

0.16 

0.16 

0.16 

0.16 

sd 

0.04 

0.04 

0.03 

0.04 

skewness 

0.21 

0.26 

0.26 

0.24 

kurtosis 

3.18 

3.51 

3.20 

3.20 

cv 

22.56 

22.46 

22.25 

22.57 

IQR 

0.05 

0.04 

0.05 

0.05 

min 

0.06 

0.04 

0.06 

0.06 

max 

0.30 

0.32 

0.31 

0.30 


Table 5: Statistics based on R = 10000 replicates of the test error on the artificial data set described above. It's clear 
that the values are indistinguishable across the four link functions. 


Example 5: Verification of Predictive Equivalence on the Pima Indian Diabetes Dataset: We once again 
consider the famous Pima Indian Diabetes dataset. The Pima Indian Diabetes Dataset is arguably 
one the most used benchmark data sets in the statistics and pattern recognition community. As 
can be see in Table ©, there is virtually no difference between the models. In other words, on 
the Pima Indian Diabetes data set, the four link functions are predictive equivalent. 

It's also noteworthy to point out that all the four models also yield similar goodness of fit mea¬ 
sures when scored using AIC and BIC. Indeed, Figure © reveals that over the R — 10000 replica¬ 
tions of the split of the data into training and test set, both the AIC and BIC are distributionally 
similar across all the four link functions. Despite the slight difference shown by the Cauchit 
model, it is fair to say that all the link functions are equivalent in terms of goodness of fit. Once 
again, this is yet another evidence to support and somewhat reinforce/confirm Teller ( 1971 )'s 
claim that all these link functions are equivalent in terms of goodness of fit, and that the over¬ 
glorification of the logit model is at best misguided if not unfounded. 


III.2 Evidence of Differences in High Dimensional Spaces 

Simulated evidence: We generate s = 10000 observations in the interval [—15,15]. For each link 
function, we compute the sign of F(x,-) — 1 /2 for i — 1, • • • , s. We then generate a table containing 
the percentage of times the signs differ. 
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probit 

compit 

cauchit 

logit 

median 

0.25 

0.24 

0.25 

0.25 

mean 

0.25 

0.25 

0.26 

0.25 

sd 

0.04 

0.04 

0.05 

0.04 

skewness 

0.06 

0.07 

0.06 

0.06 

kurtosis 

2.92 

2.95 

2.95 

2.92 

cv 

17.84 

18.33 

17.62 

17.85 

IQR 

0.06 

0.06 

0.07 

0.06 

min 

0.09 

0.07 

0.10 

0.09 

max 

0.43 

0.40 

0.45 

0.42 


Table 6: Statistics based on R = 10000 replicates of the test error on the Pima Indian Diabetes data set. It's quite 
obvious that the values are indistinguishable across the four link functions. 


AIC 


BIC 



probit compit cauchit logit 



probit compit cauchit logit 


Figure 5: Comparative boxplots of both AIC and BIC across the four link functions based on R = 10000 replications 
of model fittings. 



probit 

compit 

cauchit 

logit 

probit 

0.000 

0.004 

0.000 

0.000 

compit 

0.004 

0.000 

0.004 

0.004 

cauchit 

0.000 

0.004 

0.000 

0.000 

logit 

0.000 

0.000 

0.000 

0.000 


Table 7: All the pairs reveal a disagreement of 0% except the pairs involving the co?npit. 
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Computational Demonstrations on the Email Spam Data: Unlike all the other data sets encountered 
thus far, the email spam data set is a fairly high dimensional data set. It has a total of p — 57 
variables and n — 4601 observations. 



probit 

compit 

cauchit 

logit 

median 

0.08 

0.13 

0.07 

0.07 

mean 

0.10 

0.13 

0.07 

0.08 

sd 

0.03 

0.03 

0.04 

0.01 

skewness 

1.75 

0.88 

9.16 

4.86 

kurtosis 

9.29 

8.56 

103.95 

40.58 

cv 

34.41 

20.61 

51.15 

14.32 

IQR 

0.04 

0.04 

0.01 

0.01 

min 

0.06 

0.07 

0.05 

0.06 

max 

0.41 

0.38 

0.62 

0.18 


Table 8: Email Spam Data Set Results 


Clearly, the results depicted in Table (H} reveal some drastic differences in performance among 
the four link functions on this rather high dimensional data. The boxplots below reinforce these 
findings as they show that in terms of goodness of fit measured through AIC and BIC, the compit 
model deviates substantially from the other models. 


AIC on Email Spam Data Set 


BIC on Email Spam Data Set 




i 

e 

§ 

1 


probit compit cauchit logit 


Figure 6: Comparative Boxplots assessing the goodness of fit of the four link functions using AIC and BIC over 
R = 10000 replications of model fitting under each of the link functions. 


IV. Conclusion and discussion 

Throughout this paper, we have explored both conceptually/methodologically and computation¬ 
ally the similarities among four of the most commonly used link functions in binary regression. 
We have theoretically shed some light on some of the structural reasons that explain the indistin- 
guishability in performance in the univariate settings among the four link functions considered. 
Although section 2 concentrated mainly on the equivalence of the logit and probit, the Appendix 
provides a similar derivation for both the cauchit and the complementary log log link functions. 
We have also demonstrated by computational simulations that the four link functions are essen¬ 
tially equivalent both structurally and predictively in the univariate setting and in low dimen¬ 
sional spaces. Our last example showed computationally that the four link functions might differ 
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quite substantially when the dimensional of the input space becomes extremely large. We no¬ 
tice specifically that the performance in high dimensional spaces tends to defend on the internal 
structure of the input: completely orthogonal designs tending to bode well with all the perfectly 
symmetric link functions while the non orthogonal designs deliver best performances under the 
complementary log log. Finally, the sparseness of the input space tends to dictate the choice of 
the most appropriate link function, Cauchit tending to be the model of choice under high level of 
sparseness. In our future work, we intend to provide as complete a theoretical characterization 
as possible in extremely high dimensional spaces, namely providing the conditions under which 
each of the link function will yield the best fit for the data. 


References 

Armagan, A. and R. Zaretzki (2011). A note on mean-field variational approximations in bayesian 
probit models. Computational Statistics and Data Analysis 55, 641-643. 

Basu, S. and S. Mukhopadhyay (2000). Bayesian analysis of binary regression using symmetric 
and asymmetric links. Sankhya: The Indian Journal of Statistics 62(3), 372-387. 

Chakraborty, S. (2009). Bayesian binary kernel probit model for microarray based cancer classifi¬ 
cation and gene selection. Computational Statistics and Data Analysis 53, 4198-4209. 

Chambers, E. and D. Cox (1967). Discrimination between alternative binary response models. 
Biometrika 54(3/4), 573-578. 

Csato, L., E. Fokoue, M. Opper, B. Schottky, and O. Winther (2000). Efficient approaches to 
gaussian process classification. In S. A. Solla, T. K. Leen, and e. K.-R. Muller (Eds.), Advances 
in Neural Information Processing Systems, Number 12. MIT Press. 

Feller, W. (1940). On the logistic law of growth and its empirical verification in biology. Acta 
Biotheoretica 5, 51-66. 

Feller, W. (1971). An Introduction to Probability Theory and Its Applications (Second ed.). Volume II. 
New York: John Wiley and Sons. 

Lin, G. D. and C. Y. Flu (2008). On characterizations of the logistic distribution. Journal of Statistical 
Planning and Inference 138, 1147-1156. 

Nadarajah, S. (2004). Information matrix for logistic distributions. Mathematical and Computer 
Modelling 40, 953-958. 

Nassar, M. M. and A. Elmasry (2012). A study of generalized logistic distributions. Journal of the 
Egyptian Mathematical Society 20, 126-133. 

Schumacher, M., R. Robner, and W. Vach (1996). Neural networks and logistic regression: Part i. 
Computational Statistics and Data Analysis 21, 661-682. 

Tamura, K. A. and V. Giampaoli (2013). New prediction method for the mixed logistic model 
applied in a marketing problem. Computational Statistics and Data Analysis 66, 202-216. 

van den Flout, A., P. van der Fleijden, and R. Gilchrist (2007). The Logistic Regression Model 
with Response Variables Subject to Randomized Response. Computational Statistics and Data 
Analysis 51, 6060-6069. 


16 



Zelterman, D. (1989). Order statistics for the generalized logistic distribution. Computational 
Statistics and Data Analysis 7, 69-77. 


V. Appendix A 

Theorem 4. Consider an i.i.d sample (x\,y\), (x 2 , y 2 ), • • • , (x n ,y„) where x, e IR is a real-valued predic¬ 
tor variable, and y, G {0,1} is the corresponding binary response. First consider fitting the cauchit model 
Pr[Y ; - = 11x/] = 7 r(x;) = ^ [tan _ 1 (/ 8 x;) + y] date, let ^( cauchlt ) denote the corresponding 

estimate of /3. T/ze« consider fitting the logit model and Pr[Y,- = l|x,] = 7 r(x ; ) = 1/(1 + exp(—/3x,)) to 
the data, and let ^ eno f e tj ie corresponding estimate of j 8 . Then, 

^g(cauchit) ^ 

Proof. Given an i.i.d sample (x\,yi), (x 2/ y 2 ), • • • , (x„,y n ) and the model Pr[Y ; - = l|x ; ] = 7 r(x ; ), the 
loglikelihood for /3 is given by 

A0) =log L(j 8 ) = £{y;log/r(x / ) + (l-y / )log(l-7r(x;))}. (5) 

!=1 

For the Cauchit for instance, 7 r(x,-) = j + 2 tan _1 (/lx,). We use the Taylor series expansion 
around zero for both log(7r(x,)) and log(l — n{xf)). 

1 / . , 0 2/3x, 2/3 2 x 2 2(7 t 2 -4)/3 3 x 3 4 

log 7 r(xf) - - log 2 + —-^- 3 ^— + °( x ?) 

and 

log(l - 7 r(x;)) = - log 2 - ^ + 2(7r X ' + °( x f) 

A first order approximation of the derivative of the log-likelihood with respect to /3 is 


*'(/*) - 


- E * 


2x/ 4/3x 2 


+ (1 — 3 //) “ 


2x; 4/lx; 


^ I 4 2 4 2 

= E 


Solving £'(ft) = 0 yields 


which simplifies to 


|E Lix,y,-|ELix ; 

E"=i x ? 


2 E x iVi E X * 

S(cauchit) i =1 _ i=l 
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